Figuring out a Fair Price of a Used Car in a Data Science Way
The whole journey of using DS methods to calculate a fair price for a used car.
Introduction
What is an ordinary way of figuring out the price for a used car? You search for similar vehicles, estimate the rough baseline price and then fine-tune it depending on the current mileage, color, number of options, etc. You use both domain knowledge and current market state analysis.
If you go deeper, you may consider selling the car in a different region of the country where the average price is higher. You may even investigate for how long cars are listed in the catalog and detect overpriced samples to make more informed decision.
So there is a lot to think about, and the question I faced here was “Is it possible that using data science methods (collecting and cleaning the data, training ML models, etc.) can save your time and mental effort in a painful decision making process?”
I opened a laptop, created a new project and turned the timer on.
Stage 1. Collecting the data
Without going into too much details: I’ve managed to collect a dataset containing roughly 40,000 car ads with 35 features (mostly categorical) in two days. Collecting the data itself wasn’t too much pain, but structuring it in an organized way took a bit of time. I’ve used Python, Requests, Pandas, NumPy, SciPy, etc.
What is interesting about this particular dataset is that most of the categorical features are not encoded in any way and thus can be easily interpreted (like engine_fuel = “diesel”).
Stage 2. Looking at the big picture and dealing with bad data
Initial data analysis quickly revealed suspicious samples with 8 million kilometers odometer state, 10 liter engine hatchbacks, hybrid diesel vehicles for $600, etc. I’ve spent roughly 6 hours writing scripts to detect these issues and process them.
Visualizing the data (I’ve used MatPlotlib and Seaborn) gave me a good sense of the overall market situation.
The majority of the cars are pretty heavily used with mean odometer_value of 250,000 kilometers, and that is a lot! I’ve also noticed that people prefer to assign nice numbers to odometer_value like 250,000 km, 300,000 km, 350,000 km, etc. A bunch of cars have one million kilometers odometer_value and it does not make much sense if you look at the values distribution. I may presume that 1 million kilometers is more like a statement “This car have seen a lot, the exact number of kilometers on it I honestly don’t know.”
The general trend behind car pricing is pretty intuitive: the older the car — the lower the price. I’ve expected the age of the car to be number one feature in overall feature hierarchy.
Also, the older the car, the higher its odometer_value in general, and that is reasonable.
To build price_usd scatter plot I’ve limited maximum car price for roughly $50,000 and removed several million-level kilometer odometer_value outliers.
Actually cars that have price below $50,000 constitute 99.9 % of the catalog, so scatter plot gives a good sense of pricing trend.
Regarding car age: most of the cars have been used for a while, with the mean year_produced value of 2002. I believe that the distribution of production year (depicted below) was heavily influenced by policies around customs duties for importing cars from abroad.
Prices distribution (price_usd is going to be the target value in this project during model training) is highly skewed to the right with the mean and median price of $7,275 and $4,900 accordingly.
Some features like up_counter (number of times the ad has been promoted manually) don’t reflect parameters of the car at all, but since this data has been available, I decided to include it into the project. The distribution was so skewed that the only way to properly plot it was to use log scale.
The distribution of brands popularity wasn’t a surprise for me with the most popular model in the catalog being VW Passat, the legendary source of transportation in Belarus.
I also used Tableau to get nicer visual representation of manufacturer’s marketshare and average price for each brand.
The shape of distribution of number_of_photos that cars have is similar to price_usd distribution (distribution is skewed to the right).
Maybe the higher the price for a car, the higher number of photos?
I’ve made a joint plot, which shows a slight correlation, but more importantly, it clearly shows that the the majority of cars are cheap and have less than 15 photos.
Some features like drivetrain were just interesting to explore. On the histogram below you can see how the percentage of rear-wheel drive cars decreased over the last 30 years.
The whole correlation matrix for the dataset is depicted below (most of the features in the dataset are self-explanatory, except feature_0 … feature_9: these are boolean columns which indicate that car has features like alloy wheels, air conditioner, etc.)
I’m not going to publish full exploratory analysis here, you can check it out in the kernel. I’ve spent approximately six hours (and I need 60 more hours to fix problematic samples in the dataset) digging into the data, engineering features and only then I moved into model training.
Stage 3. Model training
Since I’ve already cleaned the dataset and applied some feature engineering with future machine learning in mind, building and training a baseline model was a breeze.
To get maximum results with the least effort I’ve used CatBoost (gradient boosting on decision trees library with comprehensive categorical feature support developed and open sourced by Yandex). I’ve already spent too much time on the project, so I just threw the data into the model, tweaked the learning rate, tree depth and a number trees in the ensemble, trained several models and started exploring model decision making with SHAP (developed by Scott Lundberg et al). Total time spent: 4 hours.
Fun fact: during initial prediction exploration phase I was disappointed in the performance of the model, started exploring the mistakes and found out that price column in my dataset had been parsed incorrectly, some prices had US dollars currency and some had the national currency of Belarus: BYN. I’ve fixed the code of the parser, gathered data again, then run cleaning, feature engineering and analytics jobs and started training models with much better results.
To train and evaluate the first model I’ve filtered out cars with prices above $30,000 (during exploration phase I’ve discovered that these samples need a separate model).
Stage 4. Model evaluation
I had no time and computational resources to run proper grid search and sequential feature selection (SFS) jobs, so I’ve just tweaked number of trees and learning rate several time and used 5-fold cross-validation to estimate the performance of the model (check full kernel).
The first decent CatBoost model got me to approximately $1,000 MAE (mean absolute error), which is 15 % of the mean value of price_usd target. To be exact on the scoring:
Best validation MAE score: $1019.18 on iteration 6413 with std $12.84
I’ve also used early_stopping parameter to skip further training when the validation score stops getting better. Without trying to improve the model in any way I moved into prediction analysis.
Looking at the distribution of errors and 2d histograms plots for true and predicted values does not tell you much about the quality of predictions.
The hierarchy of SHAP values (only top 20 features displayed) for features in the dataset didn’t seem surprising to me, apart from low position of odometer_value feature. You can find good explanation on non-parametric model interpretation in articles like this.
Age of the car, brand, type of the body, engine capacity and drivetrain are at the top of the hierarchy and that seems very reasonable.
State feature turned out to be an interesting one: the vast majority of cars are “owned”, but there is a small percentage of “new” cars (they are expensive) and a number of “emergency” vehicles which are damaged.
The problem with damaged cars is obvious: the column is boolean and there is no degree of this “emergency state” (more on this later).
Stage 5. Exploring individual predictions using domain knowledge and figuring out limitations of the model
Exploring individual predictions quickly gave me a sense of both model and data limitations. I’ll shortlist some of the samples to give a better sense of a model’s decision making process.
Sample 0: VW T5 Caravelle, 2009, mechanical gearbox, 287,000 km, diesel. Listed price is $13,600 . The prediction is lower by $1,200 (MAE for the model is about a $1,000, so that is rather typical case).
You can see that Volkswagen brand and minivan body_type make individual contributions to make the predicted price of a sample higher.
Using SHAP we can plot decision making interpretation in even nicer way by using “decision plot” functionality.
Decision plots were recently added to the library and provide even more detailed view of a model’s inner workings.
The main benefit of decision plots in comparison to force plots is that they are able to showcase a larger number of features clearly.
You can read more about this type of plots here.
Sample 1: Mercedes-Benz E270, 2000, mechanical gearbox, 465,000 km, diesel. Listed price is $4,999 . The prediction is higher by $198. That is not bad at all!
Sample 2: Jeep Grand Cherokee, 2007, automatic gearbox, 166,000 km, diesel. Listed price is $14,500. The prediction is lower by $2,796. Poor prediction, at a first glance, but the model has been listed in the catalog for 498 days! Feels like the price for this particular sample had been set really too high. It is also listed in the poorest region of the country, where cars are cheaper in general and listed for a longer period of time.
Sample 3: VW Passat, 2012, automatic gearbox, 102,000 km, gasoline. Listed price is $11,499. The prediction is lower by $64.
Sample 4: VAZ 2107 (Russian car), 1987, mechanical gearbox, 120,000 km, gasoline. Listed price is $399. The prediction is higher by $34. That is the one from a lower end of the price spectrum.
We can see how basically all feature values of this sample contribute negatively to the predicted price.
Sample 5: VW Passat, 1992, mechanical gearbox, 398,000 km, gasoline. Listed price is $750. The prediction is higher by $721(almost two times the listed value)! Why?
If we look closer to the model interpretation we can see that state = emergency is an important contributor to the predicted price. Further manual investigation of this particular case revealed that a car has been damaged by a fallen tree.
That is clearly a limitation of existing data: boolean state column simply can’t reflect the overall spectrum of levels of damage. I believe this issue can be fixed “easily” by applying two more mechanisms: image analysis with some kind of pre-trained deep CNN and entity extraction from sample description using RNNs.
I will end this sample selection with a luxury BMW 3-series.
Sample 6: BMW 316, 1994, mechanical gearbox, 320,000 km, gasoline. Listed price is $1,650. The prediction is higher by $55. We can clearly see that being a luxury brand gives some points to the sample.
I’ve spent roughly 3 hours digging into model’s predictions and manually exploring the samples.
Technical conclusion
I’ve got approximately $1,000 MAE using CatBoost Regressor on the whole dataset. But I’ve also tried to use the same approach for separate models and immediately halved the error to $500. I believe that the performance of the model will be even better if we divide the dataset in sub-datasets based on year_produced feature and train multiple models on these sub-datasets.
I’ve been also thinking that duration_listed feature may be used to penalize sample weights in the dataset. For example, if the car is listed for a year that probably means that the price is set to be too high, so we can set the weight of this sample to be lower using pooling functionality.
In general I feel like the model I’ve trained performs reasonably well, but there is a big room for further improvement.
Regarding technologies used in this project: CatBoost seems to be exactly the right choice because it provides great out-of-the-box categorical feature support. Training with 5-fold cross validation took about 13 minutes on a 2019 MacBook pro, but if I had millions of samples in a dataset, that wouldn’t be an issue because CatBoost supports training on GPU. It also has very fast prediction time which helps when models move into production.
Overall conclusion
The question I’ve tried to answer in this adventure is “Can using data science methods be justified if you are going to sell your own car?” The obvious answer for this is no: you would make a much better progress figuring out the price if you just manually search for similar vehicles online. It took me several days to complete the project in a simple way and there is so much more I could do to improve the performance of the model like apply proper feature selection, grid search, etc.
At the same time, the answer is definitely yes if you have hundreds or thousands vehicles to sell. Using Python and rich ecosystem of data science packages you can automate jobs of gathering data, building analytics and training predictive models. You can also discover trends and forecast the future of the market. You can hide models behind some APIs so they can serve the business in a reliable and convenient way.
And there is so much more. If you run business that imports used cars from abroad, you can train separate ML models to rank samples based on their profitability for the company. You can also forecast the duration these sample are going to be listed before the actual deal. You can even choose the right region for sale. The possibilities are unlimited if you just have enough data.
But if you are just doing this kind of side projects for yourself you are going to have great fun like I did!